refactor(installer): unify host preflight and thin deploy compatibility#1470
refactor(installer): unify host preflight and thin deploy compatibility#1470
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRefactors onboarding preflight to use a new host assessment/remediation API, extracts the legacy Brev deploy/bootstrap into a dedicated deploy module, deprecates Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant CLI as nemoclaw.js
participant Preflight as assessHost()
participant Planner as planHostRemediation()
participant Deploy as executeDeploy()
participant Brev as brev CLI
participant VM as Remote VM
participant Installer as scripts/install.sh
User->>CLI: run onboard or deploy
CLI->>Preflight: assessHost()
Preflight-->>CLI: HostAssessment (runtime, dockerReachable, notes)
CLI->>Planner: planHostRemediation(HostAssessment)
Planner-->>CLI: remediation actions
alt blocking remediation present
CLI-->>User: print actions and exit (onboarding skipped)
else deploy path
CLI->>Deploy: executeDeploy(instanceName, env, helpers)
Deploy->>Brev: brev ls --json
alt instance missing
Deploy->>Brev: brev create <instance>
end
Deploy->>Brev: brev refresh / poll brev ls --json
Brev-->>Deploy: instance ready
Deploy->>VM: wait for SSH
Deploy->>VM: rsync rootDir -> remote
Deploy->>VM: scp .env
Deploy->>VM: run Installer (bash scripts/install.sh --non-interactive)
Installer-->>VM: run remote preflight and onboard
alt connect allowed
Deploy->>VM: openshell sandbox connect (interactive)
else
Deploy-->>User: print SSH connection instructions
end
else onboard-only path
CLI->>Installer: run local installer/onboard
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
…ification # Conflicts: # scripts/setup-spark.sh # spark-install.md
|
🚀 Docs preview ready! |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
test/install-preflight.test.js (1)
738-752:⚠️ Potential issue | 🟡 MinorAssert the installer still exits successfully in this skip-onboarding path.
Right now the test only checks the message and the missing log file. A non-zero exit that prints the same text would still pass, which weakens the regression coverage for the intended success case.
Suggested fix
const result = spawnSync("bash", [INSTALLER, "--non-interactive"], { cwd: path.join(import.meta.dirname, ".."), encoding: "utf-8", env: { ...process.env, @@ }, }); + expect(result.status).toBe(0); expect(`${result.stdout}${result.stderr}`).toContain("Skipping onboarding"); expect(fs.existsSync(onboardLog)).toBe(false);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 738 - 752, The test currently only asserts the installer output and absence of the onboard log but doesn't verify the process exit code; update the assertion after the spawnSync call (the variable result from spawnSync using INSTALLER) to assert the installer exited successfully (e.g., expect(result.status).toBe(0) or equivalent success condition) so non-zero exits that print the same message will fail the test.docs/deployment/deploy-to-remote-gpu.md (1)
30-35:⚠️ Potential issue | 🟠 MajorThe quick start skips the setup flow the page just recommended.
Lines 26-27 say the preferred path is to install NemoClaw on the remote host and run
nemoclaw onboard, but this example jumps straight tonemoclaw my-assistant connectandopenclaw tui. Those commands only work after onboarding has already created a sandbox, so a fresh Brev VM cannot follow this section as written.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/deployment/deploy-to-remote-gpu.md` around lines 30 - 35, The example jumps straight to running "nemoclaw my-assistant connect" and "openclaw tui" but those commands require an existing sandbox created by onboarding; update the paragraph to either (a) explicitly state this is for already-onboarded Brev instances, or (b) prepend the onboarding step by instructing users to run "nemoclaw onboard" (or the recommended remote-host install + "nemoclaw onboard") before "nemoclaw my-assistant connect" and "openclaw tui", and add a short note about the prerequisite sandbox so fresh VMs can follow the flow.
🧹 Nitpick comments (4)
docs/reference/troubleshooting.md (1)
135-135: Keep one sentence per source line in this paragraph.Line 135 currently packs two sentences onto one source line. Split them so the docs stay diff-friendly and consistent with the rest of
docs/.Suggested fix
-If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding. +If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. +If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding.As per coding guidelines, "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/troubleshooting.md` at line 135, The paragraph in docs/reference/troubleshooting.md currently has two sentences on a single source line; edit that source line so each sentence is on its own line (i.e., split "If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. If onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding." into two separate source lines) to follow the one-sentence-per-line guideline and keep diffs clean.test/install-preflight.test.js (1)
570-598: Drop the overwrittendockerstub in this test setup.The first
writeExecutable(..., "docker")is immediately replaced by the second one, so half of this setup never runs. Keeping only the failing stub makes the scenario unambiguous.Suggested fix
- writeExecutable( - path.join(fakeBin, "docker"), - `#!/usr/bin/env bash -if [ "$1" = "info" ]; then - echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}' - exit 0 -fi -exit 0 -`, - ); writeExecutable( path.join(fakeBin, "openshell"), `#!/usr/bin/env bash if [ "$1" = "--version" ]; then echo "openshell 0.0.22"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 570 - 598, The test sets two different docker stubs via writeExecutable(...) with target "docker", and the first one is immediately overwritten by the second; remove the initial successful docker stub block so only the failing docker stub remains (keep the openshell stub intact), i.e. ensure there is a single writeExecutable call for "docker" (the one that exits 1 for "info") to make the scenario unambiguous.docs/reference/commands.md (1)
212-213: Use active voice in this deprecation warning.Line 213 says the installer/onboard flow “should be used instead,” which is passive. Address the reader directly here.
As per coding guidelines, "Active voice required. Flag passive constructions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/commands.md` around lines 212 - 213, Replace the passive phrasing "the standard installer and `nemoclaw onboard` flow should be used instead" with an active imperative addressing the reader; reword the sentence about `nemoclaw setup-spark` deprecation so it reads something like: "Use the standard installer and run `nemoclaw onboard` instead, because current OpenShell releases handle the older DGX Spark cgroup behavior." Update the sentence that mentions `nemoclaw setup-spark` and `nemoclaw onboard` to this active construction.test/e2e/brev-e2e.test.js (1)
68-112: Good retry/cleanup logic with a minor suggestion.The
deleteBrevInstancehelper implements robust retry logic withbrev refreshfallback. One observation: the function returnsfalsewhen deletion fails after all attempts, but the calling code at line 685-687 throws if deletion fails. This is consistent.However, consider that if
hasBrevInstancereturnstruebutbrev("delete", ...)succeeds without throwing, you immediately callbrev("refresh")and then checkhasBrevInstanceagain. If the Brev API has eventual consistency, the instance might still appear in the list briefly after deletion succeeds. The current retry loop handles this gracefully, but adding a small delay after successful delete (before the existence check) might reduce unnecessary retries.💡 Optional: Add small delay after successful delete
try { brev("delete", instanceName); + sleep(2); // Allow Brev API to propagate deletion } catch { // Best-effort delete. We'll verify via ls below and retry if needed. }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/brev-e2e.test.js` around lines 68 - 112, The deleteBrevInstance retry loop can spur unnecessary retries when brev("delete", instanceName) succeeds but the API is eventually consistent; after calling brev("delete", instanceName) inside deleteBrevInstance, insert a short sleep (use the existing sleep(seconds) helper, e.g., 1–3 seconds) before calling brev("refresh") and re-checking hasBrevInstance(instanceName) so the list has time to converge; keep the existing try/catch around brev("delete") and brev("refresh") and preserve the attempts/intervalSeconds behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 28-35: The script currently unconditionally fails if
NVIDIA_API_KEY is missing; change the check so the key is only required when the
effective provider (NEMOCLAW_PROVIDER) is the GPU-focused "build" provider (or
absent/default to "build"). Concretely, replace the unconditional check of
NVIDIA_API_KEY with a conditional that reads NEMOCLAW_PROVIDER (or uses the
default "${NEMOCLAW_PROVIDER:-build}") and only invokes fail "NVIDIA_API_KEY not
set" when that provider equals "build"; leave the installer existence check ([
-f "$INSTALLER_SCRIPT" ] || fail ...) and the exports (NEEDRESTART_MODE,
DEBIAN_FRONTEND, NEMOCLAW_* ) unchanged so onboarding for
openai/anthropic/gemini or endpoint providers can proceed without an
NVIDIA_API_KEY.
In `@scripts/install.sh`:
- Around line 924-956: The preflight is mixing informational notes
(host.runtime, host.notes) with actionable warnings so the script prints "Host
preflight found warnings." even when planHostRemediation(host) returned no
actions; split the output into two buffers (e.g., info_lines and action_lines)
so Detected container runtime and WSL messages go into info_lines and only
remediation items from actions/blockingActions go into action_lines, only set
status/exit code based on blockingActions and only print the warning header
("Host preflight found warnings." / remediation list) when action_lines is
non-empty; keep printing info_lines unconditionally so runtime/WSL remain
visible without triggering the warning path.
In `@spark-install.md`:
- Around line 5-14: The document is now internally inconsistent: the
intro/prereqs claim current OpenShell no longer needs the Spark-specific cgroup
workaround and that the NemoClaw installer/onboarding will install OpenShell
automatically, but later sections still instruct manual OpenShell installation
and the old setup-spark/default-cgroupns-mode=host fix; update the Quick Start
and "What's Different on Spark" sections to match the new supported path by
removing or revising any instructions that tell users to run setup-spark or set
default-cgroupns-mode=host and by deleting manual OpenShell install steps, and
if necessary replace them with a short note pointing to the onboarding installer
behavior (referencing the "Quick Start" and "What's Different on Spark" headings
and the phrases "setup-spark" and "default-cgroupns-mode=host" and "OpenShell
CLI") so the page consistently documents only the automatic onboarding/install
flow.
In `@src/lib/deploy.ts`:
- Around line 240-248: The current existence check uses substring matching
(out.includes(name)) which can falsely match names like "foo" in "foo-prod" or
in error text; update the check after calling execFileSync("brev", ["ls"]) to
split the output into lines (e.g., out.split(/\r?\n/)), trim each line and test
for an exact equality to name (use lines.some(line => line.trim() === name)) to
set exists; apply the same exact-line check to err.stdout and err.stderr instead
of includes so only an exact instance name marks existence (referencing the
exists variable and the out/err handling in this try/catch block).
- Around line 294-297: Replace the insecure SSH/SCP option
StrictHostKeyChecking=no used in the execFileSync calls (the SSH checks invoking
"ssh" and the scp upload of the .env) with a safer host-key policy: either Point
SSH to a pinned known_hosts file via -o UserKnownHostsFile=/path/to/known_hosts
and remove StrictHostKeyChecking=no, or use TOFU by setting -o
StrictHostKeyChecking=accept-new; apply this change to every execFileSync
invocation that runs "ssh" or "scp" (the calls that pass name and the one that
uploads the .env) and ensure the commands also reference the chosen known_hosts
file when appropriate.
---
Outside diff comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Around line 30-35: The example jumps straight to running "nemoclaw
my-assistant connect" and "openclaw tui" but those commands require an existing
sandbox created by onboarding; update the paragraph to either (a) explicitly
state this is for already-onboarded Brev instances, or (b) prepend the
onboarding step by instructing users to run "nemoclaw onboard" (or the
recommended remote-host install + "nemoclaw onboard") before "nemoclaw
my-assistant connect" and "openclaw tui", and add a short note about the
prerequisite sandbox so fresh VMs can follow the flow.
In `@test/install-preflight.test.js`:
- Around line 738-752: The test currently only asserts the installer output and
absence of the onboard log but doesn't verify the process exit code; update the
assertion after the spawnSync call (the variable result from spawnSync using
INSTALLER) to assert the installer exited successfully (e.g.,
expect(result.status).toBe(0) or equivalent success condition) so non-zero exits
that print the same message will fail the test.
---
Nitpick comments:
In `@docs/reference/commands.md`:
- Around line 212-213: Replace the passive phrasing "the standard installer and
`nemoclaw onboard` flow should be used instead" with an active imperative
addressing the reader; reword the sentence about `nemoclaw setup-spark`
deprecation so it reads something like: "Use the standard installer and run
`nemoclaw onboard` instead, because current OpenShell releases handle the older
DGX Spark cgroup behavior." Update the sentence that mentions `nemoclaw
setup-spark` and `nemoclaw onboard` to this active construction.
In `@docs/reference/troubleshooting.md`:
- Line 135: The paragraph in docs/reference/troubleshooting.md currently has two
sentences on a single source line; edit that source line so each sentence is on
its own line (i.e., split "If you are using Podman, NemoClaw warns and
continues, but OpenShell officially documents Docker-based runtimes only. If
onboarding or sandbox lifecycle fails, switch to Docker Desktop, Colima, or
Docker Engine and rerun onboarding." into two separate source lines) to follow
the one-sentence-per-line guideline and keep diffs clean.
In `@test/e2e/brev-e2e.test.js`:
- Around line 68-112: The deleteBrevInstance retry loop can spur unnecessary
retries when brev("delete", instanceName) succeeds but the API is eventually
consistent; after calling brev("delete", instanceName) inside
deleteBrevInstance, insert a short sleep (use the existing sleep(seconds)
helper, e.g., 1–3 seconds) before calling brev("refresh") and re-checking
hasBrevInstance(instanceName) so the list has time to converge; keep the
existing try/catch around brev("delete") and brev("refresh") and preserve the
attempts/intervalSeconds behavior.
In `@test/install-preflight.test.js`:
- Around line 570-598: The test sets two different docker stubs via
writeExecutable(...) with target "docker", and the first one is immediately
overwritten by the second; remove the initial successful docker stub block so
only the failing docker stub remains (keep the openshell stub intact), i.e.
ensure there is a single writeExecutable call for "docker" (the one that exits 1
for "info") to make the scenario unambiguous.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5a93e840-8490-4a99-8aa7-214864ddfd40
📒 Files selected for processing (22)
README.mdbin/lib/onboard.jsbin/nemoclaw.jsdocs/deployment/deploy-to-remote-gpu.mddocs/get-started/quickstart.mddocs/reference/commands.mddocs/reference/troubleshooting.mdscripts/brev-setup.shscripts/install.shscripts/setup-spark.shscripts/start-services.shspark-install.mdsrc/lib/deploy.test.tssrc/lib/deploy.tssrc/lib/preflight.test.tssrc/lib/preflight.tstest/cli.test.jstest/e2e/brev-e2e.test.jstest/e2e/test-spark-install.shtest/install-preflight.test.jstest/runner.test.jstest/usage-notice.test.js
💤 Files with no reviewable changes (1)
- scripts/setup-spark.sh
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/install-preflight.test.js (1)
709-755:⚠️ Potential issue | 🟠 MajorThis no longer verifies the non-interactive terms gate.
Because this case doesn't force preflight to pass, the installer can skip onboarding before
show_usage_notice()runs and still satisfy/Skipping onboarding/. That means the test now passes for unrelated Docker/OpenShell state instead of proving the acceptance requirement.Suggested fix
writeNodeStub(fakeBin); + writeExecutable( + path.join(fakeBin, "docker"), + `#!/usr/bin/env bash +if [ "$1" = "info" ]; then + echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}' + exit 0 +fi +exit 0 +`, + ); + writeExecutable( + path.join(fakeBin, "openshell"), + `#!/usr/bin/env bash +if [ "$1" = "--version" ]; then + echo "openshell 0.0.22" + exit 0 +fi +exit 0 +`, + ); writeNpmStub( fakeBin, `if [ "$1" = "pack" ]; then @@ const result = spawnSync("bash", [INSTALLER, "--non-interactive"], { @@ }); - expect(`${result.stdout}${result.stderr}`).toMatch( - /Skipping onboarding|--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/, - ); + expect(result.status).not.toBe(0); + expect(`${result.stdout}${result.stderr}`).toMatch( + /--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/, + ); expect(fs.existsSync(onboardLog)).toBe(false);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 709 - 755, The test no longer guarantees the installer reached the terms gate because preflight can fail earlier; update the test so preflight is forced to succeed and the installer reaches show_usage_notice() before asserting onboarding was skipped. Concretely, in the test around INSTALLER invocation ensure you arrange/stub the preflight checks (or set the env variable(s) the preflight uses) so they return success (so show_usage_notice() is executed), keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE empty to simulate non-acceptance, then assert the output contains the terms-acceptance hints and that onboardLog does not exist; reference the installer invocation via INSTALLER, the show_usage_notice() path, the NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and the onboardLog file to locate where to change the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/preflight.ts`:
- Around line 203-207: detectNvidiaGpu currently returns true just because the
nvidia-smi binary exists; change it to first verify commandExists("nvidia-smi",
runCaptureImpl) and then actually probe the device by running nvidia-smi (e.g.
runCaptureImpl("nvidia-smi -L", { ignoreError: true }) or
runCaptureImpl("nvidia-smi --query-gpu=name --format=csv,noheader", {
ignoreError: true })) and return true only if the probe returns
non-empty/expected output (trimmed) indicating at least one GPU; otherwise
return false. Use the existing runCaptureImpl and commandExists helpers and keep
error handling via the ignoreError option so an absent/failed probe yields false
rather than throwing.
In `@test/e2e/brev-e2e.test.js`:
- Around line 279-285: The pre-cleanup incorrectly assumes
deleteBrevInstance(INSTANCE_NAME) succeeded; change the pre-cleanup to check the
boolean result and throw or retry/log an error when it returns false so we don't
treat a failed delete as success. Also ensure instanceCreated is set as soon as
the VM is actually created (either by returning a created flag from
runLocalDeploy or by setting instanceCreated = true immediately after detecting
creation) so that afterAll() will always attempt deletion even if later
bootstrap steps fail; update runLocalDeploy usage and afterAll cleanup logic to
rely on that explicit created flag.
---
Outside diff comments:
In `@test/install-preflight.test.js`:
- Around line 709-755: The test no longer guarantees the installer reached the
terms gate because preflight can fail earlier; update the test so preflight is
forced to succeed and the installer reaches show_usage_notice() before asserting
onboarding was skipped. Concretely, in the test around INSTALLER invocation
ensure you arrange/stub the preflight checks (or set the env variable(s) the
preflight uses) so they return success (so show_usage_notice() is executed),
keep NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE empty to simulate non-acceptance, then
assert the output contains the terms-acceptance hints and that onboardLog does
not exist; reference the installer invocation via INSTALLER, the
show_usage_notice() path, the NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE env var, and
the onboardLog file to locate where to change the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 80955231-b32a-4dc6-8604-041a7326547d
📒 Files selected for processing (9)
README.mdbin/lib/onboard.jsbin/nemoclaw.jsdocs/reference/commands.mdscripts/install.shspark-install.mdsrc/lib/preflight.tstest/e2e/brev-e2e.test.jstest/install-preflight.test.js
✅ Files skipped from review due to trivial changes (1)
- docs/reference/commands.md
🚧 Files skipped from review as they are similar to previous changes (3)
- README.md
- spark-install.md
- bin/lib/onboard.js
| // Pre-cleanup: delete any leftover instance with the same name. | ||
| // This can happen when a previous run's create succeeded on the backend | ||
| // but the CLI got a network error (unexpected EOF) before confirming, | ||
| // then the retry/fallback fails with "duplicate workspace". | ||
| try { | ||
| brev("delete", INSTANCE_NAME); | ||
| deleteBrevInstance(INSTANCE_NAME); | ||
| console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`); |
There was a problem hiding this comment.
Don't lose cleanup state on partial instance lifecycle failures.
deleteBrevInstance() returns false when the instance still exists, but the pre-cleanup branch treats any return as success. Also, instanceCreated is only set after runLocalDeploy() returns; if deploy creates the VM and then fails during remote bootstrap, afterAll() skips deletion and leaks the instance.
Suggested fix
try {
- deleteBrevInstance(INSTANCE_NAME);
- console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);
+ const hadLeftover = hasBrevInstance(INSTANCE_NAME);
+ if (hadLeftover && !deleteBrevInstance(INSTANCE_NAME)) {
+ throw new Error(`Failed to delete leftover instance "${INSTANCE_NAME}" before test start`);
+ }
+ if (hadLeftover) {
+ console.log(`[${elapsed()}] Deleted leftover instance "${INSTANCE_NAME}"`);
+ }
} catch {
// Expected — no leftover instance exists
}
if (TEST_SUITE === "deploy-cli") {
console.log(`[${elapsed()}] Running nemoclaw deploy end to end...`);
- runLocalDeploy(INSTANCE_NAME);
- instanceCreated = true;
+ try {
+ runLocalDeploy(INSTANCE_NAME);
+ } finally {
+ instanceCreated = hasBrevInstance(INSTANCE_NAME);
+ }
try {
brev("refresh");
} catch {Also applies to: 290-299
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/brev-e2e.test.js` around lines 279 - 285, The pre-cleanup
incorrectly assumes deleteBrevInstance(INSTANCE_NAME) succeeded; change the
pre-cleanup to check the boolean result and throw or retry/log an error when it
returns false so we don't treat a failed delete as success. Also ensure
instanceCreated is set as soon as the VM is actually created (either by
returning a created flag from runLocalDeploy or by setting instanceCreated =
true immediately after detecting creation) so that afterAll() will always
attempt deletion even if later bootstrap steps fail; update runLocalDeploy usage
and afterAll cleanup logic to rely on that explicit created flag.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
docs/deployment/deploy-to-remote-gpu.md (2)
115-115: Use active voice instead of passive."is not set" is a passive construction. Rewrite to use active voice.
Suggested revision:
- "If you do not set
CHAT_UI_URLon a headless host, the compatibility wrapper prints a warning."As per coding guidelines: "Active voice required. Flag passive constructions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/deployment/deploy-to-remote-gpu.md` at line 115, The sentence uses passive voice ("is not set"); change it to active voice by addressing the reader and the actor—reword the line mentioning `CHAT_UI_URL` so it reads like "If you do not set `CHAT_UI_URL` on a headless host, the compatibility wrapper prints a warning." Update the occurrence that references `CHAT_UI_URL` and "compatibility wrapper" to this active-voice phrasing.
30-30: Use active voice instead of passive."has already been onboarded" is passive. Rewrite to make the action clearer.
Suggested revision:
- "If you have already onboarded your Brev instance with a sandbox, start with the standard sandbox chat flow:"
As per coding guidelines: "Active voice required. Flag passive constructions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/deployment/deploy-to-remote-gpu.md` at line 30, Replace the passive phrase in the sentence "If your Brev instance is already up and has already been onboarded with a sandbox, start with the standard sandbox chat flow:" with an active-voice construction; specifically change it to "If you have already onboarded your Brev instance with a sandbox, start with the standard sandbox chat flow:" so the action ("you have onboarded") is explicit and follows the active-voice guideline.docs/reference/troubleshooting.md (1)
135-136: Consider rewording to avoid successive sentences starting with "If."Both sentences begin with "If," which slightly reduces readability. You could combine them or restructure the second sentence.
For example:
- "If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. Switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding when onboarding or sandbox lifecycle fails."
Or introduce variation:
- "If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only. Should onboarding or sandbox lifecycle fail, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/troubleshooting.md` around lines 135 - 136, Reword the second sentence in the paragraph that currently begins "If onboarding or sandbox lifecycle fails..." to avoid repeating "If" after the first sentence ("If you are using Podman, NemoClaw warns and continues, but OpenShell officially documents Docker-based runtimes only."); either combine the two into one sentence or replace the second "If" with an alternative phrasing such as "Should onboarding or sandbox lifecycle fail, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding" or move the instruction earlier ("Switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding if onboarding or sandbox lifecycle fails") so the paragraph reads smoothly without successive "If" starters.test/install-preflight.test.js (1)
571-599: Remove the duplicatedockerstub write to avoid dead setup.
path.join(fakeBin, "docker")is written twice; the second stub fully overrides the first one. Keeping only the failingdocker infostub makes intent clearer.♻️ Proposed cleanup
- writeExecutable( - path.join(fakeBin, "docker"), - `#!/usr/bin/env bash -if [ "$1" = "info" ]; then - echo '{"ServerVersion":"29.3.1","OperatingSystem":"Ubuntu 24.04","CgroupVersion":"2"}' - exit 0 -fi -exit 0 -`, - ); writeExecutable( path.join(fakeBin, "docker"), `#!/usr/bin/env bash if [ "$1" = "info" ]; then exit 1 fi exit 0 `, );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 571 - 599, There are two writeExecutable calls targeting the same path (path.join(fakeBin, "docker")) so the second overrides the first; remove the earlier successful docker stub and keep only the failing docker-info stub so the test setup reflects the intended failure scenario; update/remove the duplicate writeExecutable for path.join(fakeBin, "docker") and ensure the openshell writeExecutable remains unchanged.test/e2e/brev-e2e.test.js (1)
291-304: Extract the new deploy bootstrap branch out ofbeforeAll().Adding a third provisioning mode here makes an already large setup callback harder to lint and debug. Please move the
deploy-cli, launchable, and bare-instance bootstrap flows into dedicated helpers and keepbeforeAll()to routing plus shared assertions only. As per coding guidelines,**/*.{js,ts,tsx}: Enforce cyclomatic complexity limit of 20 (ratcheting down to 15) via ESLint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/brev-e2e.test.js` around lines 291 - 304, The beforeAll() block has grown too large due to the inline "deploy-cli" provisioning flow; extract that branch into a dedicated helper (e.g., function bootstrapDeployCli or setupDeployCli) so beforeAll() only routes and calls shared assertions. Move the code that handles TEST_SUITE === "deploy-cli" — the console logs, instanceCreated=true, runLocalDeploy(INSTANCE_NAME), the brev("refresh") try/catch, waitForSsh(), computing remoteHome via ssh("echo $HOME") and setting remoteDir — into the new helper, update beforeAll() to call that helper when TEST_SUITE === "deploy-cli", and ensure any variables it needs/returns (instanceCreated, remoteDir) are passed or returned so tests keep the same behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/brev-e2e.test.js`:
- Around line 76-80: The helper listBrevInstances currently swallows all
exceptions and returns [] which masks CLI failures; change listBrevInstances to
propagate errors (do not catch-and-return empty) so callers can distinguish real
"no instances" from lookup/auth/parse failures, and update deleteBrevInstance
and hasBrevInstance to either catch and surface that error or implement a
retry-on-transient-failure strategy before deciding to treat the instance as
absent; reference the functions listBrevInstances(), hasBrevInstance(), and
deleteBrevInstance() when making these changes.
In `@test/install-preflight.test.js`:
- Around line 752-755: The test "requires explicit terms acceptance" currently
expects result.status to be 0 but CI shows the installer enforces the terms gate
and exits with non-zero; update the assertion to expect result.status toBe(1)
(change expect(result.status).toBe(0) to expect(result.status).toBe(1)) and keep
or add an assertion that `${result.stdout}${result.stderr}` contains the
terms-acceptance message/variable (e.g.,
/--yes-i-accept-third-party-software|NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1/) so
the test aligns with the enforced terms gate.
---
Nitpick comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Line 115: The sentence uses passive voice ("is not set"); change it to active
voice by addressing the reader and the actor—reword the line mentioning
`CHAT_UI_URL` so it reads like "If you do not set `CHAT_UI_URL` on a headless
host, the compatibility wrapper prints a warning." Update the occurrence that
references `CHAT_UI_URL` and "compatibility wrapper" to this active-voice
phrasing.
- Line 30: Replace the passive phrase in the sentence "If your Brev instance is
already up and has already been onboarded with a sandbox, start with the
standard sandbox chat flow:" with an active-voice construction; specifically
change it to "If you have already onboarded your Brev instance with a sandbox,
start with the standard sandbox chat flow:" so the action ("you have onboarded")
is explicit and follows the active-voice guideline.
In `@docs/reference/troubleshooting.md`:
- Around line 135-136: Reword the second sentence in the paragraph that
currently begins "If onboarding or sandbox lifecycle fails..." to avoid
repeating "If" after the first sentence ("If you are using Podman, NemoClaw
warns and continues, but OpenShell officially documents Docker-based runtimes
only."); either combine the two into one sentence or replace the second "If"
with an alternative phrasing such as "Should onboarding or sandbox lifecycle
fail, switch to Docker Desktop, Colima, or Docker Engine and rerun onboarding"
or move the instruction earlier ("Switch to Docker Desktop, Colima, or Docker
Engine and rerun onboarding if onboarding or sandbox lifecycle fails") so the
paragraph reads smoothly without successive "If" starters.
In `@test/e2e/brev-e2e.test.js`:
- Around line 291-304: The beforeAll() block has grown too large due to the
inline "deploy-cli" provisioning flow; extract that branch into a dedicated
helper (e.g., function bootstrapDeployCli or setupDeployCli) so beforeAll() only
routes and calls shared assertions. Move the code that handles TEST_SUITE ===
"deploy-cli" — the console logs, instanceCreated=true,
runLocalDeploy(INSTANCE_NAME), the brev("refresh") try/catch, waitForSsh(),
computing remoteHome via ssh("echo $HOME") and setting remoteDir — into the new
helper, update beforeAll() to call that helper when TEST_SUITE === "deploy-cli",
and ensure any variables it needs/returns (instanceCreated, remoteDir) are
passed or returned so tests keep the same behavior.
In `@test/install-preflight.test.js`:
- Around line 571-599: There are two writeExecutable calls targeting the same
path (path.join(fakeBin, "docker")) so the second overrides the first; remove
the earlier successful docker stub and keep only the failing docker-info stub so
the test setup reflects the intended failure scenario; update/remove the
duplicate writeExecutable for path.join(fakeBin, "docker") and ensure the
openshell writeExecutable remains unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5a4e6183-9de9-418c-a7ee-9a357ba718fd
📒 Files selected for processing (10)
docs/deployment/deploy-to-remote-gpu.mddocs/reference/commands.mddocs/reference/troubleshooting.mdscripts/brev-setup.shscripts/install.shspark-install.mdsrc/lib/deploy.tssrc/lib/preflight.tstest/e2e/brev-e2e.test.jstest/install-preflight.test.js
✅ Files skipped from review due to trivial changes (3)
- scripts/install.sh
- spark-install.md
- src/lib/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- scripts/brev-setup.sh
- src/lib/deploy.ts
| function listBrevInstances() { | ||
| try { | ||
| return JSON.parse(brev("ls", "--json")); | ||
| } catch { | ||
| return []; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n test/e2e/brev-e2e.test.js | head -120 | tail -80Repository: NVIDIA/NemoClaw
Length of output: 2985
🏁 Script executed:
grep -n "beforeAll\|deleteBrevInstance\|afterAll" test/e2e/brev-e2e.test.js | head -30Repository: NVIDIA/NemoClaw
Length of output: 539
🏁 Script executed:
sed -n '276,290p' test/e2e/brev-e2e.test.jsRepository: NVIDIA/NemoClaw
Length of output: 755
🏁 Script executed:
sed -n '276,684p' test/e2e/brev-e2e.test.js | wc -lRepository: NVIDIA/NemoClaw
Length of output: 62
🏁 Script executed:
sed -n '686,710p' test/e2e/brev-e2e.test.jsRepository: NVIDIA/NemoClaw
Length of output: 1123
🏁 Script executed:
sed -n '276,684p' test/e2e/brev-e2e.test.js | head -100Repository: NVIDIA/NemoClaw
Length of output: 4577
🏁 Script executed:
sed -n '276,684p' test/e2e/brev-e2e.test.js | tail -150Repository: NVIDIA/NemoClaw
Length of output: 6043
🏁 Script executed:
sed -n '64,130p' test/e2e/brev-e2e.test.jsRepository: NVIDIA/NemoClaw
Length of output: 1674
Don't mask Brev lookup failures as "instance absent".
listBrevInstances() catches all exceptions (including CLI errors, parse failures, and transient auth issues) and returns [], making hasBrevInstance() and deleteBrevInstance() unable to distinguish a lookup failure from a non-existent instance. This means if brev ls --json fails transiently during pre-cleanup or afterAll(), both will report success while the VM remains running. Keep lookup errors distinct from "not found"—throw or retry from deleteBrevInstance() rather than converting them to an empty result.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/brev-e2e.test.js` around lines 76 - 80, The helper listBrevInstances
currently swallows all exceptions and returns [] which masks CLI failures;
change listBrevInstances to propagate errors (do not catch-and-return empty) so
callers can distinguish real "no instances" from lookup/auth/parse failures, and
update deleteBrevInstance and hasBrevInstance to either catch and surface that
error or implement a retry-on-transient-failure strategy before deciding to
treat the instance as absent; reference the functions listBrevInstances(),
hasBrevInstance(), and deleteBrevInstance() when making these changes.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/install-preflight.test.js (1)
772-775: Consider tightening the exit status assertion.The test name "requires explicit terms acceptance" implies the installer should fail without acceptance, yet the assertion accepts both exit codes 0 and 1. Per
checkAcceptanceAndPromptNonInteractiveinbin/lib/onboard.js, non-acceptance in non-interactive mode should callprocess.exit(1).If exit 0 is a valid outcome (e.g., installation succeeds but onboarding is skipped), consider updating the test name to reflect that nuance. Otherwise, the assertion should enforce
expect(result.status).toBe(1)to properly verify the "requires" behavior.The test still validates that onboarding doesn't run (
onboardLogdoesn't exist) and the acceptance message is shown, so the behavior is partially verified.♻️ Stricter assertion if non-zero exit is expected
- expect([0, 1]).toContain(result.status); + expect(result.status).toBe(1);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 772 - 775, The test allows both exit codes 0 and 1 but should enforce the expected failure on missing acceptance; update the assertion in test/install-preflight.test.js to assert result.status is 1 (replace expect([0, 1]).toContain(result.status) with expect(result.status).toBe(1)) so it matches the behavior of checkAcceptanceAndPromptNonInteractive in bin/lib/onboard.js that calls process.exit(1) when acceptance is not provided, or alternatively rename the test to clarify that exit 0 is acceptable if you intentionally intend to allow skipping onboarding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/install-preflight.test.js`:
- Around line 772-775: The test allows both exit codes 0 and 1 but should
enforce the expected failure on missing acceptance; update the assertion in
test/install-preflight.test.js to assert result.status is 1 (replace expect([0,
1]).toContain(result.status) with expect(result.status).toBe(1)) so it matches
the behavior of checkAcceptanceAndPromptNonInteractive in bin/lib/onboard.js
that calls process.exit(1) when acceptance is not provided, or alternatively
rename the test to clarify that exit 0 is acceptable if you intentionally intend
to allow skipping onboarding.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e5a67a8c-af40-4186-be94-1bb158dc3f41
📒 Files selected for processing (1)
test/install-preflight.test.js
Summary
deploybehavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handlingsetup-spark,brev-setup.sh) in favor of the canonical installer + onboard flowWhat Changed
src/lib/preflight.tssrc/lib/deploy.tsnemoclaw deployto use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flowscripts/setup-spark.shand reducedscripts/brev-setup.shto a deprecated compatibility wrapperValidation
npm run build:clisrc/lib/preflight.test.ts,src/lib/deploy.test.ts,test/install-preflight.test.js,test/cli.test.js,test/runner.test.jsTEST_SUITE=deploy-clioncpu-e2.4vcpu-16gbstatus=RUNNING,build_status=COMPLETED,shell_status=READYRelated Issues
Credit / Prior Work
This branch builds on ideas and prior work from:
Summary by CodeRabbit
New Features
Improvements
nemoclaw onboard.Deprecations
nemoclaw deploy,nemoclaw setup-spark, and the legacy bootstrap wrapper are now deprecated compatibility paths.Documentation
Tests